Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Location Provider Documentation #1537

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

smaheshwar-pltr
Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr commented Jan 18, 2025

(See below for screenshots)

Closes #1510. This is my first time writing docs here! Happy to receive style feedback - I already suspect I've written too much.

cc @kevinjqliu @Fokko

@smaheshwar-pltr smaheshwar-pltr marked this pull request as ready for review January 18, 2025 16:16
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths |
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled |
| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure about what Options to put here. Happy to take suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Don't love how this looks. I prefer what it is now:

image

I've changed the section linked above to have mymodule.MyLocationProvider in custom Catalog style. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The above screenshot also shows how code/backticks hyperlinks look, I think they're fine. This is now relevant because of #1537 (comment).


### SimpleLocationProvider

The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realised I was wrong in #1510 (comment) about docs not needing to change when write.data.path is supported. I think we do want to say that data files are under a data directory, here and below.

But I think this is fine because the change will be small - it'd just be "write.data.path" instead of "data directory under the table's location" throughout. write.data.path defaults to this.


The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`.

### ObjectStoreLocationProvider
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of natural duplication between this section and https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout. I've gone less in-depth here though.

I was unsure whether to link to this webpage (and if so, how to word it) because there's a lot that's not relevant to us, e.g.

Note, the path resolution logic for ObjectStoreLocationProvider is write.data.path then /data. However, for the older versions up to 0.12.0, the logic is as follows: - before 0.12.0, write.object-storage.path must be set. - at 0.12.0, write.object-storage.path then write.folder-storage.path then /data.

and

Previously provided base64 hash was updated to base2 in order to provide an improved auto-scaling behavior on S3 General Purpose Buckets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should link to that for additional context

s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
```

### Loading a Custom LocationProvider
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -30,7 +30,12 @@


class LocationProvider(ABC):
"""A base class for location providers, that provide data file locations for write tasks."""
"""A base class for location providers, that provide data file locations for a table's write tasks.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these docs so folks looking to write a custom LocationProvider see

image

### Loading a Custom LocationProvider

Similar to FileIO, a custom LocationProvider may be provided for a table by concretely subclassing the abstract base
class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The
Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to link to this, and this works for me locally, but I get the following warning when serving docs locally:

INFO - Doc file 'configuration md' contains an unrecognized relative link '.. / reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider', it was left as is. Did you mean ' reference/pyiceberg/table/locations.md#pyiceberg.table.locations.LocationProvider'?

But I get a similar warning elsewhere: Doc file 'SUMMARY.md' contains an unrecognized relative link 'reference/', it was left as is.. So I think this is fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i think its fine, as long as the hyperlink works when you run it locally

@smaheshwar-pltr
Copy link
Contributor Author

smaheshwar-pltr commented Jan 18, 2025

Here are screenshots of (current) changes. I've verified that all the new links work.

(Edit: I've since made minor spelling changes. After this PR is reviewed, I'll post the up-to-date screenshots, so folks can see what'll be merged, instead of posting on every change)

image image image image

@smaheshwar-pltr smaheshwar-pltr changed the title Documentation for Location Providers Docs: Location Provider Documentation Jan 18, 2025
An example, custom `LocationProvider` implementation is shown below.

```py
import uuid
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only shown this import for conciseness.

| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk |
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add a warning or something about how this default differs from the java implementation

The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories,
into file paths, to distribute files across a larger number of object store prefixes.

Paths contain partitions just before the file name and a `data` directory beneath the table's location, in a similar
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1537 (comment) re data here.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! These docs are great. I've added a few nit comments

| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths |
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled |
| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk |
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add a warning or something about how this default differs from the java implementation

| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths |
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyperlinks are weird sometimes, can you make sure that these work as intended

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked all hyperlinks on the current version and they work as intended

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved

### SimpleLocationProvider

The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example,
The `SimpleLocationProvider` places a table's data file names underneath a `data` directory in the table's storage location. For example,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also call out that the "base location" is the table.metadata.location

From the spec, https://iceberg.apache.org/spec/#table-metadata-fields

The table's base location. This is used by writers to determine where to store data files, manifest files, and table metadata files.

s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
```

When data is partitioned, files under a given partition are grouped into a subdirectory, with that partition key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When data is partitioned, files under a given partition are grouped into a subdirectory, with that partition key
When the table is partitioned, files under a given partition are grouped into a subdirectory, with that partition key

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe also when the hive-style partition path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this and this is wha you meant

Comment on lines +254 to +256
```txt
s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about giving an example of this set to True and another one set to False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the False case just above ("the same data file above" here) - or do you mean making that more explicit?


The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`.

### ObjectStoreLocationProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should link to that for additional context

### Loading a Custom LocationProvider

Similar to FileIO, a custom LocationProvider may be provided for a table by concretely subclassing the abstract base
class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i think its fine, as long as the hyperlink works when you run it locally

class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The
table property `write.py-location-provider.impl` should be set to the fully-qualified name of the custom
LocationProvider (i.e. `module.CustomLocationProvider`). Recall that a LocationProvider is configured per-table,
permitting different location provision for different tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mention that java uses write.location-provider.impl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Documentation for Location Providers
2 participants